Installing and Loading Necessary Packages

First, we need to install and load the necessary packages, including dplyr for data manipulation and tidyverse and plotly for visualization.

# install.packages("dplyr")
# install.packages("tidyverse")
# install.packages("plotly")
library(dplyr)
library(tidyverse)
library(plotly)

Reading and Viewing the Titanic Dataset We start by reading the Titanic dataset from a CSV file into a dataframe.

titanic_data <- read.csv("Titanic-Dataset.csv")

You can view the dataset in a spreadsheet-like format:

View(titanic_data)

Summary Statistics and Data Structure

To understand the dataset better, we run summary statistics and examine its structure.

summary(titanic_data)
##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
## 
str(titanic_data)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Data Cleaning 1. Fill Missing Age Values We replace missing values in the Age column with the median age.

titanic_data <- titanic_data %>%
  mutate(Age = ifelse(is.na(Age), median(Age, na.rm = TRUE), Age))
  1. Remove Rows with Missing Embarked Values We filter out rows where the Embarked column is empty.
titanic_data <- titanic_data %>%
  filter(Embarked != "")
  1. Drop the Cabin Column The Cabin column has many missing values, so we drop it from the dataset.
titanic_data <- titanic_data %>%
  select(-Cabin)
  1. Convert Columns to Appropriate Data Types We convert relevant columns to factors, which are better suited for categorical data.
titanic_data <- titanic_data %>%
  mutate(Survived = as.factor(Survived),
         Pclass = as.factor(Pclass),
         Sex = as.factor(Sex),
         Embarked = as.factor(Embarked))
  1. Renaming Columns We change the column names to lowercase for consistency.
names(titanic_data) <- tolower(names(titanic_data))
  1. Saving the Cleaned Data Finally, we save the cleaned data to a new CSV file.
write.csv(titanic_data, "cleaned_titanic_data.csv", row.names = FALSE)

Titanic Dataset Visualization

Overview The Titanic dataset contains detailed information about the passengers aboard the Titanic, including their age, class, fare, survival status, and more. Through these visualizations, we aim to uncover insights into the factors that influenced survival and the demographic composition of the passengers.

Survival Count Bar Plot This plot displays the count of passengers who survived versus those who did not.

ggplot(titanic_data, aes(x = survived)) +
  geom_bar() +
  xlab("Survived") +
  ylab("Count") +
  ggtitle("Count of Survived Passengers on the Titanic")

Age Distribution Histogram This histogram shows the distribution of passengers’ ages with bins of 5 years.

ggplot(titanic_data, aes(x = age)) +
  geom_histogram(binwidth = 5, fill = "blue", color = "black") +
  xlab("Age") +
  ylab("Count") +
  ggtitle("Age Distribution of Passengers")

Boxplot of Age by Survival Status This boxplot compares the age distribution of passengers who survived and those who did not.

ggplot(titanic_data, aes(x = survived, y = age)) +
  geom_boxplot() +
  xlab("Survived") +
  ylab("Age") +
  ggtitle("Age Distribution by Survival Status")

Violin Plot of Age by Survival Status This violin plot provides a more detailed look at the age distribution by survival status, including the density of the distribution.

ggplot(titanic_data, aes(x = survived, y = age)) +
  geom_violin() +
  xlab("Survived") +
  ylab("Age") +
  ggtitle("Age Distribution by Survival Status")

Passenger Class Distribution Bar Plot This bar plot illustrates the distribution of passengers across different classes.

ggplot(titanic_data, aes(x = pclass)) +
  geom_bar(fill = "red") +
  xlab("Passenger Class") +
  ylab("Count") +
  ggtitle("Count of Passengers by Class")

Embarkation Point Distribution Bar Plot This bar plot shows how many passengers embarked from each point (C = Cherbourg, Q = Queenstown, S = Southampton).

ggplot(titanic_data, aes(x = embarked)) +
  geom_bar(fill = "purple") +
  xlab("Embarkation Point") +
  ylab("Count") +
  ggtitle("Count of Passengers by Embarkation Point")

Scatter Plot of Age vs. Fare This scatter plot examines the relationship between passengers’ ages and the fares they paid.

ggplot(titanic_data, aes(x = age, y = fare)) +
  geom_point(color = "red") +
  xlab("Age") +
  ylab("Fare") +
  ggtitle("Scatter Plot of Age vs. Fare")

Facet Grid Scatter Plot of Age vs. Fare by Survival Status This plot adds a facet grid to the scatter plot to separate passengers by survival status.

ggplot(titanic_data, aes(x = age, y = fare)) +
  geom_point() +
  facet_grid(. ~ survived) +
  xlab("Age") +
  ylab("Fare") +
  ggtitle("Scatter Plot of Age vs. Fare by Survival Status")

Facet Grid of Age vs. Fare by Passenger Class This plot shows the relationship between age and fare, with separate facets for each passenger class.

ggplot(titanic_data, aes(x = age, y = fare)) +
  geom_point(color = "green") +
  facet_grid(. ~ pclass) +
  xlab("Age") +
  ylab("Fare") +
  ggtitle("Scatter Plot of Age vs. Fare by Passenger Class")

Combined Scatter Plot of Age vs. Fare by Passenger Class This plot displays a scatter plot of age versus fare, color-coded by passenger class.

ggplot(titanic_data, aes(x = age, y = fare, color = pclass)) +
  geom_point(size = 2) +
  scale_color_manual(values = c("1" = "red", "2" = "orange", "3" = "green")) +
  xlab("Age") +
  ylab("Fare") +
  ggtitle("Age vs. Fare by Passenger Class") +
  labs(color = "Passenger Class")

Interactive Scatter Plot of Age vs. Fare by Passenger Class and Sex This interactive plot uses plotly to allow exploration of the data by passenger class and sex.

ggplotly(
  ggplot(titanic_data, aes(x = age, y = fare, color = sex)) +
    geom_point() +
    facet_wrap(~ pclass) +
    xlab("Age") +
    ylab("Fare") +
    ggtitle("Age vs. Fare by Passenger Class and Sex")
)

Stacked Bar Plot of Survival by Passenger Class This plot visualizes the survival proportions across different passenger classes using a stacked bar plot.

ggplot(titanic_data, aes(x = pclass, fill = survived)) +
  geom_bar(position = "fill") +
  xlab("Passenger Class") +
  ylab("Proportion") +
  labs(fill = "Survived") +
  ggtitle("Survival Proportions by Passenger Class")

Interactive Stacked Bar Plot of Survival by Passenger Class This interactive version of the stacked bar plot uses plotly, showing survival proportions by passenger class with hover information.

plot <- plot_ly(titanic_data, 
  x = ~pclass, 
  y = ~percentage, 
  type = 'bar', 
  color = ~survived,
  text = ~paste('Status:', survived, '<br>Percentage:', round(percentage, 2), '%'),
  hoverinfo = 'text',
  textposition = 'auto') %>%
  layout(barmode = 'stack',
         xaxis = list(title = 'Passenger Class'),
         yaxis = list(title = 'Percentage'),
         title = 'Survival Proportions by Passenger Class',
         legend = list(title = list(text = 'Survival Status')))

Conclusion This document walked through the process of cleaning the Titanic dataset and provided various visualizations to explore the data further. These plots reveal insights into survival rates, age distributions, and the relationships between different variables such as age, fare, and passenger class.